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Abstract. Consider the problem of estimating the entries of a large 
matrix, when the observed entries are noisy versions of a small random 
fraction of the original entries. This problem has received widespread 
attention in recent times, especially after the pioneering works of Em- 
manuel Candes and collaborators. Typically, it is assumed that the 
underlying matrix has low rank. This paper introduces a simple estima- 
tion procedure, called Universal Singular Value Thresholding (USVT), 
that works for any matrix that has 'a little bit of structure'. In partic- 
ular, the matrix need not be of low rank. The procedure is very simple 
and fast, works under minimal assumptions, and is applicable for very 
large matrices. Surprisingly, this simple estimator achieves the mini- 
max error rate up to a constant factor. The method is applied to give 
simple solutions to difficult questions in low rank matrix estimation, 
blockmodels, distance matrix completion, latent space models, positive 
definite matrix completion, problems related to graph limits, and gen- 
eralized Bradley- Terry models for pairwise comparison. 
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1. Introduction 

Consider a statistical estimation problem where the unknown parameter 
is not a single value or vector, but an m x n matrix M. Given an estimator 
M, one choice for a measure of the error in estimation is the mean-squared 
error, defined as 



(1) MSE(M):=E 



EEC 



mn 

1=1 j 



Here mjj and denote the (i,j)th elements of M and M, respectively. If 
we have a sequence of such problems, and M„ and M„ denote the parameter 
and the estimator in the nth problem, then by usual statistical terminology 
we may say that the sequence of estimators M„ is consistent if 

lim MSE(M„) = 0. 
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To fix ideas, let us begin by posing a list of challenging matrix estimation 
problems. The background and literature for these problems will be dis- 
cussed in detail in Section HI 



Problem 1. (Low rank matrices.) Suppose that Xij, i = 1, . . . ,m, j = 
1, . . . , n is a bunch of independent random variables, modeled as 

where e^j's have mean zero and mjj's are unknown constants. Additionally, 
suppose that each Xij is observed with probability p and unobserved with 
probability 1—p, where p is some number in [0, 1]. Note that p is allowed to 
be quite small, depending on m and n. Assume that the Xij's are uniformly 
bounded above and below by fixed constants. In general, it is impossible to 
estimate the m-jj's using the Xij^s, since there is exactly one observation for 
each parameter. However, if the rank r of the matrix M = (772-ij)i<i<m, i<j<n 
is sufficiently small, one may hope to accurately estimate M from the ob- 
served Xjj's. The question is: What is a necessary and sufficient condition, 
in terms of r, m, n and p, under which this estimation problem is solvable? 
Low rank matrix estimation has been tackled in a vast number of papers in 
recent times [ini[Illiai88l|9a[2S[25l[28l[H[Z9l[65l[Ml^ 



will consider this problem in Section 4.1 as the simplest application of our 
technique. 



Problem 2. (Stochastic blockmodels.) Consider an undirected graph on 
n vertices. A stochastic blockmodel assumes that the vertices 1, . . . ,n are 
partitioned into k blocks, and the probability that a vertex i is connected 
to vertex j by an edge depends on the blocks to which i and j belong. 
The edges are independent of each other. Let M be the matrix whose 
(i,j)th element is the probability of an edge existing between vertices i 
and j. If the number of blocks is comparable to n, then one cannot hope 
to estimate the edge probabilities. The question is: Given that we do not 
know which vertex belongs to which block, or the number of blocks, is it 
possible to accurately estimate M (in mean-squared error) from a single 
realization of the graph, simply under the condition that k is small compared 
to n? The stochastic blockmodel has received a lot of attention over several 
decades [Ml ESI Ell |371 [111 [HI EQl [El EHl EHl above question 



has not been completely resolved. We will solve it in Section 4.2 



Problem 3. (Distance matrix completion.) Suppose that we have a com- 
pact metric space K with metric d. Let xi, . . . , Xn he n points from K. Let 
Mn be the matrix whose {i,j)th. entry is d{xi,Xj). Now suppose that we do 
not observe M„, but only a small fraction of its entries. More precisely, sup- 
pose that there is a number p £ [0, 1] such that each entry of M„ is observed 
with probability p and unobserved with probability 1—p, independent of 
other entries. The set of observed entries of Mn is our data. The question 
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is: Given that we do not know anything about the metric space K or the 
metric d (except that it is compact), or the points xi, . . . ,Xn, is it possible 
to consistently estimate Mn as n ^ oo in the sense of mean squared error? 
Distance matrix completion has been a popular topic in the social sciences, 
multidimensional scaling and various other fields [TTl [71 [THI EH |Ml IS] 
but a proper statistical theory has never been developed. In the Euclidean 
case it is solvable by low rank methods, but general distance matrix com- 
pletion is an unsolved problem. We will give the solution of this problem in 
Section 14.31 



Problem 4. (Latent space models.) Suppose that our data is an undirected 
graph on n vertices. The following is an example of a latent space model for 
this data: To each vertex is attached a latent (or hidden) variable Pi, which is 
a vector in a compact set K CM.^ where k is the dimension of the parameter 
space. Assume that the edges are independent, and the probability that 
there exists an edge between two vertices i and j is f{j3i,j3j), where / : 
IT X — )• M is a continuous function. The parameter in this model is the 
n X n matrix M„ whose («,j)th entry is f{f3i,f3j). The question is: Given 
that we do not know the function f, the parameters /3i, . . . ,f3n, the region 
K , or even the dimension k, is it still possible to consistently estimate the 
matrix Mn in the sense of mean-squared error as n — )• oo ? Latent space 
models are used extensively in the world of network analysis, but rigorous 
results are scarce [571 113 EZl [551 El 131] . In particular, not much is known 
about how to estimate the parameters if the function / is unknown. We will 



solve this general problem in Section 4.4 



Problem 5. (Positive definite matrix completion.) Suppose that we have a 
positive semidefinite matrix M„ of order n with all entries bounded by some 
fixed constant, and only a random fraction p of its entries are observed. 
Is it possible to estimate the matrix Mn without assuming any additional 
condition such as low rank? What is the fastest that p can tend to zero 
as n ^ oo so that it is possible to estimate Mn consistently? Indeed, it 
is not obvious that M„ may be consistently estimated even if p remains 
fixed. Positive definite matrix completion has been a popular topic in linear 
algebra [53l [Ml [E] and more recently, in statistical matrix completion [25l 
23 [2B] in the context of covariance matrices. However, the completion of 
positive definite matrices that are not necessarily of low rank is an open 



problem. We will solve it in Section 4.5 



Problem 6. (Estimating graphons.) Suppose that / : [0, 1]^ — t- [0, 1] is a 
measurable function and C/i, . . . , C/„ are i.i.d. Uniform[0, 1] random variables. 
Suppose that a random graph G„ on n vertices is constructed by putting 
an edge between vertices i and j with probability f{Ui,Uj). In modern 
graph theory, a function such as / is called a 'graphon' and G„ is called a 
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graph 'sampled from the graphon /'. Let Mn be the random matrix whose 
(i,j)th element is f{Ui,Uj). The question is: Given that we do not know 
the function f nor anything about the values of Ui, . . . ,Un, is it possible to 
consistently estimate M„ by only observing the graph Gn ? This is similar 
to Problem 4, with the difference that the function / is assumed to be only 
measurable; continuity is not assumed. The study of graphons has been 
popular in the recent graph theory literature [20l [211 ESI E31 El] a-^id has 
connections to the study of weakly exchangeable arrays [IJl |9l EJ [59] . They 
have also appeared recently in large deviations j32l [Ml [ZS] and mathematical 
statistics \30\ [86] . The question of estimating a graphon using a single graph 
sampled from the graphon has not been tackled in the literature, perhaps 
because of the impossible-looking nature of the problem. We will solve this 
question in Section [4. 6[ 



Problem 7. (Non-parametric Bradley- Terry model.) Suppose there are n 
teams playing against each other in a tournament. Every team plays against 
every other team exactly once. Suppose that pij is the probability that team 
i wins against team j in a game between i and j. The Bradley- Terry model 
assumes that pij is of the form ai/{ai + aj) for some unknown nonnegative 
numbers ai, . . . , an- It is known how to estimate the parameters ai, . . . ,an 
if we assume that the outcomes of all games are independent. Now sup- 
pose that we drop the assumption about the particular form of pij in terms 
of hidden parameters and make the 'non-parametric assumption' that the 
teams have a particular ordering in terms of strength that is unknown to 
the observer, and that if team i is stronger than team j, then pik > pjk 
for all k 7^ i,j. Note that the parametric model is a special case of the 
non-parametric version. There are several questions: In the non-parametric 
Bradley-Terry model, is it possible to estimate all the pij 's from a tourna- 
ment where every team plays against every other exactly once? Is it possible 
to estimate the pij 's if only a randomly chosen fraction of the games are 
played? Is the estimation possible even if the fraction p tends to zero as 
n — )• DO ? The Bradley- Terry model has a long history and a huge body of 
literature [23 [Ml [ini[Sni[llHZl[H3[Zniinni[77l[3Il[nil[Ml[S3]. It is therefore 
somewhat surprising that no one has considered the natural non-parametric 
version described above. We will show in Section 4.7 how all parameters may 
be simultaneously estimated in the non-parametric Bradley- Terry model. 



It turns out that all of the estimation problems posed above can be solved, 
without any additional assumption, by a single method. The goal of the 
next section is to introduce a simple matrix estimation procedure capable of 
solving all of the above problems and potentially much more. The procedure 
will be applied to solve the posed problems in Section [4] 
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2. The USVT algorithm and estimator 

2.1. The setup. Suppose that we have a m x n matrix M, where m < n 
and the entries of M are bounded by 1 in absolute value. Let X be a matrix 
whose elements are independent random variables, and E(xjj) = m^- for all 
i and j (where, as usual, xij and rriij denote the (i,j)th entries of X and 
M respectively). Assume that the entries of X are also bounded by 1 in 
absolute value, with probability one. A matrix such as X will henceforth 
be called a 'data matrix with mean M\ The matrix M will sometimes be 
called the 'parameter matrix'. Let p be a real number belonging to the 
interval [0, 1]. Suppose that each entry of X is observed with probability p, 
and unobserved with probability 1 — p, independently of the other entries. 

The above model will henceforth be referred to as the 'asymmetric model'. 
The 'symmetric model' is defined in a similar manner: Take any n and let 
M be a symmetric matrix of order n, whose entries are bounded by 1 in 
absolute value. Let X be a symmetric random matrix of order n whose 
elements on and above the diagonal are independent, and K{xij) = nriij for 
all 1 < 2 < J < n. As before, assume that the entries of X are almost surely 
bounded by 1 in absolute value. Take any p £ [0, 1] and suppose that each 
entry of X on and above the diagonal is observed with probability p, and 
unobserved with probability 1 — p, independently of the other entries. 

Similarly, one can define the 'skew-symmetric model', where the difference 
X—M is skew-symmetric, with independence on and above the diagonal as in 
the symmetric model. This model is used for analyzing the non-parametric 



Bradley- Terry model in Section 4.7 



2.2. The USVT estimator. In all of the above models, we construct an 
estimator M of M based on the observed entries of X along the follow- 
ing steps. The author tentatively calls this the Universal Singular Value 
Thresholding (USVT) algorithm. 

1. For each let yij = Xij if Xij is observed, and let yij = if Xij is 
unobserved. Let Y be the matrix whose (i,j)th entry is yij. 

2. Let Y = Yl'iLi SiUivJ be the singular value decomposition of Y . (In the 
symmetric and skew-symmetric models, m = n.) 

3. Let p be the proportion of observed values of X. In the symmetric and 
skew-symmetric models, let p be the proportion of observed values on 
and above the diagonal. 

4. Let 

S := {i:si> 2.02y^}. 

[The constant 2.02 is not that important; it is simply an arbitrary choice 
strictly bigger than 2. The algorithm may be implemented with any 
constant strictly bigger than 2. Moreover, if it is known that Var(xij) < 
o"^ for all where o" is a known constant < 1, then the threshold 
2.02y/np may be improved to 2.02\/ng, where q := pa^ +p(l — p)(l — o"^)- 
For example, if the entries of X are known to belong to the interval [0, 1] 
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instead of [—1,1], then VaT{xij) < 1/4: for all i, j , so if p = 1 then in this 
case 2.02 may be replaced by 1.01.] 

5. Define 

W := - SiUivJ. 
^ i&S 

6. Let Wij denote the (i,j)th element of W. Define 

Wij if — 1 < Wij < 1, 
1 if Wij > 1, 
—1 if Wij < —1. 

If the entries of M and X are known to belong to the interval [0, 1] instead 
of [—1, 1], let rhij = when Wij < 0. If X = M with probability one, let 
rhij = Xij when Xij is an observed value. 

7. Let M be the matrix whose (i,j)th entry is rhij. 

2.3. The main theorem. Recall that the nuclear norm of M, written 
jjiVf]],,,, is defined as the sum of the singular values of M. Recall also the 
definition ([T]) of the mean squared error of a matrix estimator. The following 
theorem gives an error bound for the estimator M in terms of the nuclear 
norm of M. This is the main result of this paper. 

Theorem 2.1. Let M and M be as above. LetMSE{M) be defined as in Q. 
Suppose that p > n~^~^'' for some e > 0. Then 

MSE(M) <Cmin|^^ + — ij + C7(g)e-'="P, 
[ m^Jnp np mn J 



where C and c are positive universal constants and C(e) depends only on e. 
The same result holds for the symmetric and skew- symmetric models, after 
putting m = n. Moreover, in the case where we know that Var(xij) < cr^ 
for all i,j for some known < \, and the threshold is set at 2.02-v/ng 
(see Step 4 of the algorithm), the same result holds under the condition that 
q > n^^^*^, where q := pa'^ +p(l — p)(l — cr^). In this case the exponential 
term in the error changes to C{e)e~'^'^'^ . 

Incidentally, the proof shows that the condition p > n~^^'' may be im- 



proved to p > n~^(logn)^^'^ (see Theorem 8.4), but the author prefers to 
retain the present version for aesthetic reasons, especially considering that 
it is not a real improvement from any practical point of view. 

It should be emphasized that although singular value thresholding has 
been used in a number of papers on matrix completion and estimation (e.g. 
[TUl [U [211 ESI EH])) the above algorithm has the unique feature that the 
threshold is universal. In the literature, it is usually assumed that the matrix 
M has a rank r that is known, and uses the value of r while thresholding. 
The USVT algorithm manages to cut off the singular values at the 'correct' 
level, depending on the structure of the unknown parameter matrix. The 



8 



SOURAV CHATTERJEE 



adaptiveness of the USVT threshold is somewhat shnilar in spirit to that of 
the SureShrink algorithm of Donoho and Johnstone |l5] . 

A second remark is that if M or X have entries that have absolute value 
greater than 1, one can easily deal with the situation by scaling M and X 
by an appropriate constant. Similarly, if the number of rows of M is greater 
than the number of columns, we work with the transposed matrices M'^ and 
X'^ instead of M and X, so that the condition m < n is satisfied. 

2.4. The minimax lower bound. It is not difficult to prove that for an 
mxn matrix M with entries bounded by 1 in absolute value, where m < n, 
the nuclear norm is bounded by m^/n. Given a number 5 S [0,■m^/n\, one 
may take an arbitrary estimator M and try to find the M among all M 
satisfying ||-/Vf||^, < 5 for which MSE(M) is maximum. Recall that an esti- 
mator that minimizes this maximum error is classically known as a minimax 
estimator. The following theorem shows that our estimator M is minimax 
up to a constant multiplicative factor and an exponentially small additive 
discrepancy. 

Theorem 2.2. Consider the general matrix estimation problem outlined in 
the beginning of this section. Given any estimator M and any 6 G [0,m^/n\, 
there exists M satisfying < 6 and a data matrix X with mean M, 

such that for this M and X , the estimator M satisfies 

MSE(M) > c mini + — , — , ll, 

[ m^Jrvp np mn J 

where c is a positive universal constant. Moreover, if p < 1/2 then X and 
M may be chosen such that X = M . The same lower bound holds in the 
symmetric case and in the skew-symmetric case. 

It is worth noting that the exponentially small discrepancy is necessary. 
For example, if 5 = 0, then the minimax error is obviously zero. However, 
there is still an exponentially small chance that M may be nonzero. It is also 
worth noting that if 6 is not too small (e.g. if 6 > ^/m/p), then the expo- 
nential discrepancy does not matter, and the combination of Theorems |2.1| 
and |2.2| gives the correct minimax error up to a universal multiplicative 
constant. 

2.5. A numerical result. Although many examples will be worked out in 
Section |4] and extensive simulations results will be presented in Section [TJ 
we present the output from one set of simulations right here in this section 
to satisfy the reader's curiosity about the efficacy of the algorithm. The 
model that was simulated is a special case of the general class of latent 
space models discussed in Section [TJ Suppose that a random graph with 
independent edges is defined on n vertices, where the probability that vertex 
i is joined to vertex j by an edge is mij, where mij obeys the logistic model 

777 ■ ' 

(2) log -^^= (3, + (3j + a/3il3j. 

171 ■■ 
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(a) n = 200. 



(b) n = 1000. 



Figure 1. Plot of rhij vs. rriij in a simulated latent space 
model, for a randomly chosen subset of entries of size 200 
(see Section 2.5). 



Here a, /3i, . . . , are unknown parameters. The simulations were carried 
out with n = 200 and n = 1000, and a, /3i, . . . , /3„ were generated as i.i.d. 
Uniform [0, 1] random variables. A random graph from the above model was 
generated, and the USVT algorithm was applied to the adjacency matrix 
of this random graph to estimate the m^j's. Figure [T] gives the plot of rhij 
versus rriij as i,j vary over a subset of entries of size 200, chosen at random. 
(A random subset of size 200 was chosen because the set of all entries is too 
large to be suitable for visualization.) 

Incidentally, when the interaction parameter a is zero, the model is the 
same as the 'beta model' considered in [3T]. In that case it is known how 
to compute the maximum likelihood estimates of the /3i's. However when 
a 7^ 0, it is not clear how to estimate the parameters by existing methods. 
Figure [T] indicates that the USVT algorithm does a pretty good job even 
with medium sized graphs. This is quite remarkable, considering that the 
USVT algorithm does not 'know' that the graph comes from the logistic 
model Q; it is a fully non-parametric procedure. 

This example belongs to a class of examples widely considered in this 
manuscript: The data is a random graph with independent edges, and the 
probability that there is an edge joining vertex i to vertex j is equal to 
rriij, the (i,j)th element of a parameter matrix M. By definition, M is 



symmetric. Let Ai, . . . , A„ be the eigenvalues of M. Theorems 2.1 and 2.2 



show that M is estimable from the data if and only if Yl^=i l^^il small 
compared to n^/^, and the USVT algorithm gives a way of estimating M if 
this condition holds. We will see in Section |43] that this condition is satisfied 
not only for the logistic model considered here, but for essentially all latent 
space models. 
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2.6. Dependent entries. One may wonder whether the USVT estimate 
works if we drop the assumption of independence. It turns out that estima- 
tion is stih possible, provided we have an idea about the size of \\X — M\\, 
where — M|| denotes the spectral norm (i.e. the largest singular value) of 
the matrix X — M. The modified procedure described below is never used 
in this article but is recorded for future purposes. 

Suppose that we have an unknown matrix M of order m x n, where 
m < n. Let X be a random matrix. Assume that all the entries of X and 
M are bounded by 1 in absolute value, as before. Again, as before, suppose 
that each entry of X is observed with probability p and unobserved with 
probability 1 — p, independently of other entries. 

Assume for simplicity that p is known. Let Y be defined as in Step 1 of 
the USVT algorithm. Let 7 and e be two (known) numbers such that 

P(||p"^y-M|| > 7) < e. 

Modify the USVT algorithm as follows. Let all steps be the same, except 
that in Step 4, define S := {i : Si > I.OI7}. Then we have the following 
theorem. 

Theorem 2.3. In the above setting, 

MSE(M) < ^^Mk + ce, 
mn 

where C is a universal constant. 



2.7. Plan of the paper. Theorem 2.1 will be proved in Section^ The- 



orem [2]2] in Section |9j and Theorem 2.3 in Section 10 Section [4] contains 



applications of Theorem |2.1| to solve the problems posed in Section[T| Section 
[3] compares Theorem |2.1| to the existing literature. Section [6] shows that it is 
impossible to estimate the mean squared error of the USVT estimate with- 
out making assumptions about the parameter matrix. Section [5] examines 
the situation where the data matrix X is not assumed to follow any model 
and discoveres an interesting connection between the singular value thresh- 



olding and Szemeredi's regularity lemma from graph theory. Section 11 at 



the end of the manuscript contains a summary and discussion of the paper. 
3. Comparison with the existing literature 

The problem of estimating the entries of a large matrix from incomplete 
and/or noisy entries has received widespread attention ever since the pro- 
liferation of large data sets. Early work using spectral analysis was done 
by a number of authors in the computer science literature, for example by 
Azar et. al. |TD] and Achlioptas and McSherry pQ. This was followed by 
a sizable body of work on spectral methods, the main pointers to which 
may be found in the important recent papers of Keshavan, Montanari and 
Oh [65l|66j. Non-spectral methods also appeared, e.g. in [88]. 

In a different direction, statisticians have worked on matrix completion 
problems under a variety of modeling assumptions. Possibly the earliest 
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works are due to Fazel ^46j and Rudelson and Vershynin [92j . The emergence 
of compressed sensing |44| [27] has led to an explosion in activity in the field 
of matrix estimation and completion, beginning with the work of Candes and 
Recht [26j. The pioneering works of Emmanuel Candes and his collaborators 
[26, 28, 25, 24j introduced the technique of matrix completion by minimizing 
the nuclear norm under convex constraints, which is a convex optimization 
problem tractable by standard algorithms. This method has the advantage 
of exactly, rather than approximately, recovering the entries of the matrix 
when a suitable low rank assumption is satisfied, together with a certain 
other assumption called 'incoherence'. 

Since the publication of [26], a number of statistics papers have attacked 
the matrix completion problem from various angles. Some notable examples 
are [79l [811 [681 IMl [SZl [38]. In a different direction, a paper that seems to 
have a close bearing on the analytical aspects of this paper is a manuscript 
of Oliveira [84]. 

In addition to the theoretical advances, a large number of algorithms for 
matrix completion and estimation have emerged. The main ones are nicely 
summarized and compared in [81j. 

The following list gives some of the key points in the comparison of this 
paper with the existing literature on matrix estimation and completion. 



1. The literature mainly deals with low rank matrix completion and esti- 
mation. Some of the algorithms require the user to know the rank a 
priori (e.g. |65| 166] ). whereas some others do not have this requirement 
(e.g. [26]). Theorem 2.1 does not require low rank, neither does it require 
adaptive estimation of the rank or any other quantity. 

2. There are some exceptions to the low rank assumption, for example the 
work of Negahban and Wainwright [79] and that of Davenport et. al. [38] . 
The paper [38] is particularly relevant for us, since it gives the error 



bound in terms of the nuclear norm, just like Theorem 2.1 In fact, the 



authors of [38j have essentially proved the minimax rate given by the 



combination of Theorems 2.1 and 2.2 However, although [38J gives an 



error rate comparable with Theorem 2.1 the algorithm proposed in [3^ 
is a complicated optimization problem that is sometimes (but not always) 
convex. 

3. As mentioned above, the idea of singular value thresholding has appeared 
before [10^ [H [65] [66j . but these methods work only for low rank matri- 
ces and require the user to know the rank of the matrix a priori. Our 
algorithm does not have these requirements. Incidentally, the idea of 
replacing missing entries by zeros has appeared earlier in [HB] . 

4. The algorithm proposed in this paper is possibly the simplest and fastest 
among all of the algorithms available in the literature. It involves no iter- 
ations. Only one singular value decomposition is enough. The truncation 
threshold does not require any adaptive estimation or any knowledge 
about the underlying matrix. The method is applicable for very large 
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matrices, where the methods based on convex optimization are possibly 
too slow to be useful. 

5. There are no assumptions in Theorem |2.1| that are complicated or ar- 
tificial. The only assumptions required are that the matrix entries are 
uniformly bounded and independent, and that the observed entries are 
chosen at random. However one should note that it may be possible to 
get stronger results by making more assumptions. For example, the as- 
sumption of incoherence in the papers of Candes and coauthors \26\ [28] 
allows exact recovery of low rank matrices, whereas the minimal assump- 
tion setting of Theorem 2.1 allows only approximate recovery. On the 
other hand, Theorem 2.1 goes beyond low rank matrices. 

6. This paper has many examples that are worked out in complete detail. 
This, again, diverges from the standard practice: Most papers simply 
indicate a list of areas and examples where the techniques may be appli- 
cable. However, there are some exceptions such as |79j. 

4. Examples 

Throughout this section, m, n, M, X, p and M will be as in Section [2] 
Just to remind the reader, M is an m x n matrix where 1 < m < n. The 
entries of M are assumed to be bounded by 1 in absolute value. The matrix 
X is a random matrix whose entries are independent, and the (i, j)th element 
Xij has expected value equal to rriij, the {i,j)th entry of M. Moreover, they 
satisfy \xij\ < 1 with probability one. In particular, X may be exactly equal 
to M, with no randomnness. Each entry of X is observed with probability 
p and unobserved with probability 1 — p, independently of other entries. 
Occasionally, we will assume the symmetric model, where m = n, and the 
matrices M and X are symmetric. In the special case of the Bradley- Terry 



model in Section 4.7, we will assume the skew-symmetric model, where 



X — M is skew-symmetric. 

We will now work out various specific cases where Theorem 2.1 gives 
useful results. 

4.1. Low rank matrices. As mentioned in Section [3j estimating low rank 
matrices has been the focus of the vast majority of prior work [TUl [H HHl IHSl 



IHSllSnillSlllHlIMlESllMllMllMlEZlI^ Theorem [2I] works for low rank 
matrices. The following theorem, which is a simple corollary of Theorem 2.1 
shows that M is a good estimate whenever the rank of M is small compared 
to mp (after assuming, as in Theorem 2.1 that p > 



Theorem 4.1. Suppose that M has rank r. Suppose that p > n ^"'"'^ for 
some e > 0. Then 



MSE(M) < CmmL[^+ — , 1 1 + C(e)e-'='^''. 
[ y mp up J 

where C and c are universal constants and C(e) depends only on e. More- 
over, the same result holds when M and X are symmetric. 
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Proof. Recall that the Frobenius norm of M is defined as 

(m n X 1/2 

EE-'.) • 
i=i j=i ^ 

Note that ||M|||;, is also the sum of squares of the singular values of M. The 
number of nonzero singular values is equal to the rank of M. Therefore by 
the Cauchy-Schwarz inequality, 

(3) ||M||, < Vrank(M)||M||j., 



and therefore < ^rmn. The result now follows from Theorem 2.1 □ 



The term 1/np in the error bound is necessary to take care of the case 
r = 0. Even if M is identically zero, the estimator M will incur some error 
due to the (possible) randomness in X. 

Let us now inspect how the condition r <^ mp compares with available 
bounds. Keshavan, Montanari and Oh |65| [66] obtain the same condition. 



but only if m and n are comparable. Theorem 4.1 , on the other hand, works 
for 'very rectangular' matrices also, where m <^n. Moreover, the approach 
of Keshavan et. al. requires the user to know the rank a priori. 

Candes and Tao [2H] obtain the condition r ^ mp with an extra poly- 
logarithmic term in the error. Moreover they too require that m and n be 
comparable, and additionally they need the so-called 'incoherence condition'. 
However, as noted before, the incoherence condition allows exact recovery, 
while our approach only gives approximate recovery. 

The recent work of Davenport et. al. |38j gives an estimator with an error 



bound that is almost the same as that given by Theorem 4.1, but with a 
complicated optimization algorithm. 

The following theorem shows that the condition r <C mp is necessary for 
estimating M. 

Theorem 4.2. Given any estimator M , there exists an mx n matrix M of 
rank r with entries hounded between —1 and 1, such that when the data is 
sampled from M , 

MSE(M) > C(l 

where C is a positive universal constant and [m/r] is the integer part ofm/r. 

Proof. Let M be an m x n random matrix whose first r rows consist of i.i.d. 
Uniform[— 1, 1] random variables, and copy this block [m/r] times. Declare 
the remaining rows, if any, to be zero. Then note that M has rank < r. 

Let D be our data, that is, the observed values of M. One can imagine 
D as a matrix whose (i, j)th entry is mij if m^- is observed, and a question 
mark if mij is unobserved. For any belonging to the nonzero portion 
of the matrix M, M contains [m/r] copies of mij. Since the M- value at 
the location of each copy is observed with probability p, independent of the 
other copies, the chance that none of these copies are observed is equal to 
none of the copies are observed, then the data contains no 
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information about rriij. Using this, it is not difficult to write down a formal 
argument that shows 

E(Var(m,j | D)) > C7(l -p)'™/''! 

where C is some universal constant. On the other hand, since rhij is a 
function of D, the definition of variance implies that 

E((mjj — rriij)'^ \ D) > Var(mij | D). 

Combining the last two displays, we see that 

This completes the proof. □ 

4.2. The stochastic blockmodel. Consider an undirected graph on n ver- 
tices. A stochastic blockmodel assumes that the vertices 1, . . . ,n are par- 
titioned into k blocks, and the probability that vertex i is connected to 
vertex j by an edge depends only on the blocks to which i and j belong. 
As usual, edges are independent of each other. Let M be the matrix whose 
(i, j)the element is the probability of an edge existing between vertices i and 
j. The matrix X here is the adjacency matrix of the observed graph. Here 
all elements of X are observed, so p = 1. 

This is commonly known as the stochastic blockmodel. It was introduced 
by Holland, Laskey and Leinhardt [58] as a simple stochastic model of social 
networks. It has become one of the most successful and widely used models 
for community structure in networks, especially after the advent of large 
data sets. 

Early analysis of the stochastic blockmodel was carried out by Snijders 
and Nowicki \96\ [82], who provided consistent parameter estimates when 
there are exactly two blocks. This was extended to a finite but fixed number 
of blocks of equal size by Condon and Karp [37]. Bickel and Chen [16J were 
the first to give consistent estimates for finite number of blocks of unequal 
size. It was observed by Leskovec et. al. [Hj that in real data, the number 
of blocks often seem to grow with the number of nodes. This situation was 
rigorously analyzed for the first time in Rohe et. al. [90], and was followed 
up shortly thereafter by [TTl [36l [TH [M] with more advanced results. 

However, all in all, the author is not aware of any estimator for the sto- 
chastic blockmodel that works whenever the number of blocks is small com- 
pared to the number of nodes. The best result till date is in the very recent 
manuscript of Rohe et. al. [M], who prove that a penalized likelihood estima- 
tor works whenever k is comparable to n 'up to log factors'. The following 
theorem says that the USVT estimator M gives a complete solution to the 
estimation problem in the stochastic blockmodel if A; <C n, with no further 
conditions required. (The method will not work very well for sparse graphs, 
however; for recent advances on estimation in sparse graphs, see [35].) 
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Theorem 4.3. For a stochastic blockmodel with k blocks, 

MSE(M) < cJ-, 
V n 

where C is a universal constant. 

Proof. If two vertices i and j are in the same block, then the ith and jth 
rows of M are identical. Therefore M has at most k distinct rows and so the 
rank of Af is < /c. An application of Theorem |4.1| completes the proof. □ 

Note that estimating the stochastic blockmodel is a special case of low 
rank matrix estimation with noise. It is not difficult to prove that the 
estimation problem is impossible when k is of the same order as n. We will 
not bother to write down a formal proof. 

4.3. Distance matrices. Suppose that K is a compact metric space with 
metric d. Let xi, . . . , x„ be arbitrary points from K, and let M be the nxn 
matrix whose (z, j)th entry is d{xi,Xj). Such matrices are called 'distance 
matrices'. Since X is a compact metric space, the diameter of K with 
respect to the metric d must be finite. Scaling d by a constant factor, we 
may assume without loss of generality that the diameter is bounded by 1, 
so that the entries of M are bounded by 1 as required by Theorem |2.1[ 

Completing a distance matrix with missing entries has been a popular 
problem in the engineering and social sciences for a long time; see, for exam- 
ple, [971 [HI El EHl [Ml [95j . It has become particularly relevant in engineering 
problems related to sensor networks. It is also an important issue in multi- 
dimensional scaling |19| . For some recent theoretical advances, see |831 162j. 

In general, distance matrices need not be of low rank. Therefore much 
of the literature on matrix estimation and completion does not apply to 
distance matrices. Surprisingly, Theorem |2.1| gives a complete solution of 
the distance matrix completion and estimation problem. 

Theorem 4.4. Suppose that p > n"^^'' for some e > 0. If M is a distance 
matrix as above, then 

MSE(M) < "^'""^ + C(e)e-"'^P, 

where c is a universal constant, C(e) depends only on e, and C{K,d,n) is 
number depending only on K, d and n such that 

lim C{K,d,n) = 0. 

n—¥oo 

The above theorem is not wholly satisfactory, since it does not indicate 
how fast p can go to zero as n — )• oo so that M is still consistent. To 
understand that, we need to know more about the structure of the space K. 
The following theorem gives a quantitative estimate. 
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Theorem 4.5. Suppose that for each 6 > 0, N{6) is a number such that K 
may he covered by N(6) open d-balls of radius 5. Then 



MSE(M) < Cinf mill 



........... ^±v;MfV^,lUc(e)e-- 

where C and c are universal constants and C(e) depends only on e. 



To see how Theorem 4^ may be used, suppose that K is a compact subset 
of the real hue and d is the usual distance on M, scaled by a factor to ensure 
that the diameter of -ftT is < 1. Then N{6) increases like 1/(5 as (5 —t- 0. 
Consequently, given n, the optimal choice of 6 is of the order n~^^^, which 
gives the bound 

MSE(M) < — . 

(Note that the exponential term need not appear because the main term is 
bounded below by a positive universal constant if p < n~^/'^.) Thus, M is a 
consistent estimate as long as p goes to zero slower than n~^/'^ as n — t- oo. 



The proofs of Theorems 4.4 and 4.5 follow from a more general lemma that 



will also be useful later for other purposes. Suppose that S = {xi, . . . ,Xn} 
is a finite set and / : S x S ^ 1] is an arbitrary function. Suppose 
that for each 6 > 0, there exists a partition V{6) of S such that whenever 
x,y,x',y' are four points in S such that x,x' £ P for some P £ V^S) and 
y,y' G Q for some Q G V{6), then \ f{x,y) - f{x',y')\ < 5. Let M be the 
nx n matrix whose (i,j)th element is f{xi,Xj). 

Lemma 4.6. In the above setting, 



MSE(M) < Cinf min|^±V^52!^ , i| +c(6)e— ^ 

where C and c are universal constants, and C{e) depends only on e. 

Proof. Fix some (5 > 0. Let T be a subset of S consisting of exactly one 
point from each member of V{5). For each x £ S, let p{x) be the unique 
element of T such that x and p{x) belong to the same element of V{5). Let 
N be the matrix whose {i,j)th. element is f{p{xi),p{xj)). Then 

n 

\\M - N\\j, = ifi^uxj) - f{p{xi),p{x,))f < nH\ 

By the triangle inequality for the nuclear norm, the inequality Q and the 
above inequality, 

||M||, < \\M -N\\^ + ||7V||, 

< y/n\\M - N\\f + IIA^II* 

< n^l'^5+\\N\U. 
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Now, if Xi and xj belong to the same element of V{6), then p{xi) = p{xj) 
and hence the ith and jth rows of are identical. This shows that has 
at most 1 7^ (5) I distinct rows and therefore has rank < |7^((5)|. Therefore by 
the inequality ([s]), 

||iV||* < V\V{5)\ \\N\\f < V[P(6)\n. 
The proof is completed by applying Theorem |2.1[ □ 



Using Lemma |4.6[ it is easy to prove Theorems 4.4 and 4.5 



Proof of Theorem 4.5 . Let all notation be as in Theorem 4.5 To apply 
Lemma 4.6, let S be the set {xi, . . . ,Xn}- From the definition of N{6), it 
is easy to see that there is a partition V{5) of S of size < N{6/4:), such 
that any two points belonging to the same element of the partition are at 
distance < 6/2 from each other. Consequently, if x,x' & P and y,y' & Q for 
some P,Q £ 'PiS), then by the triangle inequality for the metric d, 

\d{x,y) - d{x',y')\ < \d{x,y) - d{x',y)\ + \d{x',y) - d{x',y')\ 

< d{x,x') + d{y,y') < 6. 

Putting / = d in Lemma |4.6[ the proof is complete. □ 



Proof of Theorem 4.4, Since K is compact, there exists a finite number N{6) 



for each (5 > such that K may be covered by N{6) open d-balls of radius 6. 



By Theorem 4.5 this shows that for any sequence 6n decreasing to zero, 



MSE(M) < Cmin|^^i±^^Z?)Z^, l} +C(e)e-^. 

To complete the proof, choose dn going to zero so slowly that A^(5„/4) = o{n) 
as n — )• oo. □ 

4.4. Latent space models. Suppose that /3i, . . . , /3„ are vectors belonging 
to some bounded closed set C M*^, where k is some arbitrary but fixed 
dimension. Let / : if — ?■ [— 1, 1] be a continuous function. Let M be the 
n X n matrix whose {i,j)th element is f{(3i,(3j). Then our data matrix X 
has the form 

where eij are independent errors with zero mean, satisfying the restriction 
that \xij\ < 1 almost surely. For example, X may be the adjacency matrix of 
a random graph where the probability of an edge existing between vertices i 
and j is f{/3i,/3j). This is one context where latent space models are widely 
used, starting with the work of Hoff, Raftery and Handcock [S7|. A large 
body of work applying the latent space approach to real data has grown in 
the last decade. On the theoretical side, it was observed in [16| [T7] that 
the latent space model arises naturally from an exchangeability assumption 
due to the Aldous-Hoover theorem ^ i59j . Note that distance matrices and 
stochastic blockmodels are both special cases of latent space models. 
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There have been various attempts to estimate parameters in the latent 
space models (e.g. [571 |55l E] ) . Almost all of these approaches rely on heuris- 
tic arguments and justification through simulations, without any kind of 
proof whatsoever. The problem is that in addition to the vectors /3i, . . . , 
the function / itself is an unknown parameter. If either /3i's are known, or / 
is known, the estimation problem is tractable. For example, when f{x,y) is 
of the form e^"'"^/(l + e^"*"^), the problem was solved in [31]. However when 
both / and /3j's are unknown, the problem becomes seemingly intractable. 
In particular, there is an identifiability issue because f{x, y) may be replaced 
by h{x,y) := fig{x), g{y)) and /3j by g~^{(3i) for any invertible function g 
without altering the model. 

In view of the above discussion, it is a rather surprising consequence of 



Theorem 2.1 that it is possible to estimate the numbers /(/3j, /3j), i,j = 
1, . . . ,n from a single realization of the data matrix, under no additional 
assumptions than the stated ones. 



Theorem 4.7. Suppose that p > n If M is as above, then 
MSE(M) < '^(^'^'Z'") + C7(e)e-c"P, 

where c is a universal constant, C(e) depends only on e, and C{K, k, f,n) 
is depends only on K , k, f and n such that 

lim C{K,k,f,n) = 0. 



Proof. We will apply Lemma 4.6, Let S be the set {/3i, . . . ,/?„}• Since / 
is continuous on K and K is compact, / must be uniformly continuous. 
This shows that for each 6 > we can find a partition 'P{6) of S satisfying 



the condition required for Lemma 4.6, such that the size of V{6) may be 



bounded by a constant N{6) depending only on K, k, f and 6. Choosing 



6n ^ slowly enough so that N{5n/^) = o{n) and applying Lemma 4.6 



completes the proof. □ 



The problem with Theorem 4.7, just like Theorem 4.4 in Section [4.3[ is 
that it does not give an explicit error bound, which makes it impossible to 
determine how fast p can go to zero with n so that consistency holds. Again, 
this is easy to fix by assuming smoothness properties of / and applying 



Lemma 4.6 As a particular example, suppose that / is Lipschitz with 
Lipschitz constant L, in the sense that 

\f{x, y) - f{x, y')\ < L\\x -x'\\+ L\\y - y'W 

for all X, y, x' , y' G K. 

Theorem 4.8. In the above setting, 

-l/(fc+2) 

MSE(M) < C(K,k,L) , 

where C{K,k, L) is a constant depending only on K, k and L. 
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Proof. Let S = {/3i, . . . , /3n}. Take any 6 > 0. From the Lipschitzness 
condition, it is easy to see that we can find a partition V{6) of S whose size 
may be bounded by C{K, k, L)6~'' , where C{K,k,L) depends only on K, 



k and L. Choosing 6 = n~^^^^'^'^^ and applying Lemma 4.6 completes the 
proof. Note that the exponential term need not appear since the main term 
is bounded below by a positive universal constant if p < n^^/^'^"*"'^^. □ 

4.5. Positive definite matrices. Assume that m = n and M is positive 
semidefinite. (In the statistical context, this is the same as saying that M is 
a covariance matrix. When the diagonal entries are all 1, M is a correlation 
matrix.) 

Completing positive definite matrices with missing entries has received a 
lot of attention in the linear algebra literature [531 ESI ES] , although most 
of the techniques are applicable only for relatively small matrices or when 
a sizable fraction of the entries are observed. In the engineering sciences, 
estimation of covariance matrices from a small subset of observed entries 
arises in the field of remote sensing (see [25 ^ \2Q \ [28] for brief discussions). 

The statistical matrix completion literature cited in Section |3] applies only 
to low rank positive definite matrices. It is therefore quite a surprise that the 
completion problem may be solved for any positive definite matrix whenever 
we get to observe a large number of entries from each row. 

Theorem 4.9. Suppose that m = n and M is positive semidefinite. Suppose 
that p > n'^^^''. Then 

MSE(M) < 4= + C(e)e-^"P, 
y/np 



where C and c are universal constants and C(e) depends only on e. 

Proof. Since M is positive semidefinite, ||-/Vf ||* = Tr(M). Since the entries of 
M are bounded by 1, Tr(M) < n. The proof now follows from an application 
of Theorem 12.11 □ 



What if p is of order 1 /n or less? The following theorem shows that it is 
impossible to estimate M in this situation. 

Theorem 4.10. Given any estimator M, there exists a correlation matrix 
M such that when the data is sampled from M, 

MSE(M) > C(l -p)", 

where C is a positive universal constant. 

Proof. Throughout this proof, C will denote any positive universal constant, 
whose value may change from line to line. 

Let Ui, . . . ,Un be i.i.d. Uniform[0, 1] random variables. Let M be the 
random matrix whose (z, j)th element ruij is equal to UiUj if i ^ j and 1 

i = j. It is easy to verify that M is a correlation matrix. Suppose that 
we observe each element of M on and above the diagonal with probability 
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p, independent of each other. Let D be our data, represented as follows: D 
is a matrix whose (z,j)th element is rriij if the element is observed, and a 
question mark otherwise. 

Now, the probability that no element from the ith row and the ith column 
is observed is exactly equal to (1 — p)". If we do not observe any element 
from the ith row and ith column, we have no information about the value of 
Ui. From this, it is not difficult to write down a formal argument to prove 
that for any j ^ i, 

Var(m,,- | D, {Uk)k^i) = [/|Var([/, | D, {Uk)k^i) > C(l 

If M is any estimator, then rhij is a function of D. Therefore by the above 
inequality and the definition of variance, 

Eiirhij - m^jf I D, {Uk)k^,) > C(l - p)"C/|, 

and thus 

E(m,j-mi,-)'>C(l-p)". 
Since this is true for all i ^ j, the proof is complete. □ 

4.6. Graphons. A graphon is a measurable function / : [0, 1]^ — )• [0, 1] that 
satisfies f{x,y) = f{y,x). The term 'graphon' was coined by Lovasz and 
coauthors in the growing literature on limits of dense graphs [ 20\ [21] \22\ \72\ 
[73]. . Such functions also arise in the related study of weakly exchangeable 
random arrays [121 [9l [6l [59]. They have also appeared recently in large 
deviations [32l [33l [75] and mathematical statistics [30l [86] . 

In the graph limits literature, graphons arise as limits of graphs with 
increasing number of nodes. Conversely, graphons are often used to generate 
random graphs in a natural way. Take any n and let Ui, . . . ,Un be i.i.d. 
Uniform[0, 1] random variables. Construct a random undirected graph on 
n vertices by putting an edge between vertices i and j with probability 
f{Ui, Uj), doing this independently for all 1 < z < j < n. This procedure is 
sometimes called 'sampling from a graphon' (see \21\ Section 4.4]). 

The statistical question is the following: Suppose that we have a random 
graph on n vertices that is sampled from a graphon. Is it possible to estimate 
the graphon from a single realization of the graph? More precisely, is it 
possible to accurately estimate the numbers /(C/j, Uj), I < i < j < n, from 
a single realization of the random graph? The question is similar to the one 
investigated in Section [4.3[ but the difference is that here we are not allowed 
to assume any regularity on / except measurability. 

Taking things back to our usual setting, let M be the matrix whose (i, j)th 
element is f{Ui,Uj). Note that unlike our previous examples, M is now 
random. So the definition of MSE should be modified to take expectation 
over M as well. 

Theorem 4.11. In the above setting, 

MSE(M) < C{f,n), 
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where C{f,n) is a constant depending only on f and n, such that 

lim C(/, n) = 0. 

Proof. Here all entries of the adjacency matrix are visible, so p = 1. Define 
a sequence of functions /i, /2, • ■ • according the following standard construc- 
tion. For each k, let Vk be the /cth level dyadic partition of [0, 1)^, that is, 
the partition of the unit square into sets of the form [{i — l)/2^ ,i/2^) x [(j — 
1) , j /2^). Let fk be the function that is equal to the average value of 
/ within each square of the partition Vk- If ^k denotes the sigma-algebra 
of sets generated by the partition Vk-, then the sequence fk is a martingale 
with respect to the filtration Tk- Moreover, fk = IE(/|J^fc). Finally, ob- 
serve that the sequence fk is uniformly bounded in L^. Combining all these 
observations, it is evident that fk^fi^L^- 

Now fix some e > and an integer n. Take a large enough k = k{e) such 
that 11/ — fk\\L^ < £• Let N be the n x n matrix whose (i,j)th element is 
fk{Ui,Uj)- Then 

n 

E||M - Nfp = nfiUi, Uj) - fkiUi, Uj)f 

<n + n^E{f{Uu U2) - fkiUi, U2)? 

(4) =n + n^\\f - fk\\l,<n + n'e\ 

Now note that if Ui and Uj belong to the same dyadic interval [r/2^, (r -|- 
l)/2^), then the ith and jth rows of N are identical. Hence N has at most 
2^ distinct rows, and therefore has rank < 2^. Therefore by ([3]), 

||iV||* < 2'=/2||iv||^ < 

Consequently, by the inequality ([s]), 

||M||, < \\M -N\\^ + IIA^II* 

(5) <^\\M - N\\F + 2''/^n- 
Combining Q and ^ gives 

E||Af||* < {2^/'^ + l)n + n^/'^e. 
Choosing a sequence going to zero so slowly that 2'=(^")/2 = ©(n-^/^)^ 

we 



can now apply Theorem 2.1 to complete the proof. □ 



4.7. Non-parametric Bradley- Terry model. Suppose there are n teams 
playing against each other in a tournament. Every team plays against every 
other team at least once (often, exactly once). Suppose that pij is the 
probability that team i wins against team j in a match between i and j. 
Then pji = 1 - p^j. 

The Bradley- Terry model [23], originally proposed by Zermelo [101], as- 
sumes that pij is of the form Oj / (oj -|- Uj ) for some unknown nonnegative 
numbers ai, . . . , a^,. It is known how to estimate the parameters ai, . - - ,an 
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if we assume that the outcomes of all games are independent — which, in 
this case, is a reasonable assumption. 

The Bradley- Terry model has found great success among practitioners. 
For an old survey of the literature on the model dating back to 1976, 
see [lU]. Numerous extensions and applications have been proposed, for 
example [Ml SI EH E3 EHIIIZI EO] • The monographs of David and Di- 
aconis [41|, Chapter 9] explain the statistical foundations of these models. 
More recently, several authors have proposed to perform Bayesian inference 
for (generalized) Bradley-Terry models O EQl ED EH [Ml [29] . 

For the basic Bradley- Terry model, it is possible to find the maximum 
likelihood estimate of the Oj's using a simple iterative procedure |10H [6 H [69]. 
The maximum likelihood estimate was shown to be jointly consistent for all 
n parameters by Simons and Yao [93]. 

We now generalize the Bradley- Terry model as follows. Suppose, as be- 
fore, that pij is the probability that team i beats team j. Suppose that the 
teams have a particular ordering in terms of strength that is unknown to 
the observer. Assume that if team i is stronger than team j, then pik > Pjk 
for all k 7^ i,j. Do not assume anything else about the Pi/s; in particular, 
do not assume any formula for the Pij^s in terms of hidden parameters. This 
is what we may call a 'non-parametric Bradley- Terry model'. Note that the 
usual Bradley- Terry model is a special case of the non-parametric version. 

In the non-parametric Bradley- Terry model, is it possible to estimate all 
the Pij^s from a tournament where every team plays against every other 
exactly once? Is it possible to estimate the pij^s if only a randomly chosen 
fraction of the games are played? How small can this fraction be, so that 
accurate estimation is still possible? The following theorem provides some 
answers. 

Theorem 4.12. Consider the non-parametric Bradley- Terry model defined 
above. Let M be the matrix whose {i,j)th entry is pij if i ^ j and if i = j ■ 
Let X be the data matrix whose {i, j)th entry is 1 if team i won over team j, 
if team j won over team i and recorded as missing if team i did not play 
versus team j. Lf team i has played against team j multiple times, let the 
{i,j)th entry of X be the proportion of times that i won over j. (Draws are 
not allowed.) Let all diagonal entries of X be zero. Given p E [0, 1], suppose 
that for each i and j, the game between i and j takes place with probability 
p and does not take place with probability 1—p, independent of other games. 
Let M be the estimate of M based on the data matrix X. Then 



where C is a universal constant. In particular, the estimation problem is 



MSE(M) < 



solvable whenever p^ n 



2/3 



Proof. As usual, throughout the proof C will denote any universal constant, 
whose value may change from line to line. 
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Recall that the definition of the skew-symmetric model stipulates that 
X — M is skew-symmetric, which is true for the non-parametric Bradley- 
Terry model. There is nothing to prove if p < n~^/^, so assume t hat p > 
j^-2/3 xhis allows us to drop the exponential term in Theorem 2.1 and 
conclude that 

(6) MSE(M)<^Ml + ^. 

Let k be an integer less than n, to determined later. For each i, let 

n 

Note that each tj belongs to the interval [0, n]. For I = 1, . . . ,k, let Ti be 
the set of all i such that ti G [n{l — l)/k,nl/k). Additionally, if ti = n, put 
i in Tfc. 

For each I, let r{l) be a distinguished element of T^. For each 1 < i, j < n, 
if z E and j £ T^, let riij := Pr{i)j- Let N be the matrix whose (i, j)th 
element is riij. Note that if G Ti for some /, then riij = rii'j for all j. In 
particular, has at most k distinct rows and therefore has rank < k. Thus 
by inequality ([s]), 

(7) \\N\\^ < Vk\\N\\F <nVk. 

Now take any 1 < i < n. Suppose that i £ Ti. Let i' = r{l). Suppose that 
team i' is weaker than team i. Then piij < pij for all j ^ i, i' . Thus, 

n n n 

^{Pij - nijf = ^{Pij -Pi'jf < J2 \Pv -Pi'jl 
j=i j=i j=i 

(8) <ti-ti, + 2<- + 2<—. 
Similarly, if team i' is stronger than team i, 

(9) ^{Pij - nijf <ti' -ti + 2< ^. 
i=i 

By 0, ^, ^ and ^ we have 

||M||* < ||M - iV||* + ||iV||=, 

< ^/n\\M - N\\f + nVk 



< h nv k. 

k 

Choosing k = [n^/^], we get 

llAflL < Cn'/^. 
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Together with this shows that 

^ -1/3 (~i 

MSE(M) < — + — . 

yf^ np 

Since p > n^^/'^, the second term on the right is redundant. □ 

A natural question is whether the threshold p ^ n~'^l^ is sharp. The 
author does not know the answer, but suspects that the estimation problem 
is solvable all the way up to p S> . 

4.8. Structured matrices. This section will try to justify the claim that 
the USVT procedure works for any matrix that has a 'little bit of structure'. 
For this one needs to defined a notion of structure in a matrix. Our defi- 
nition, informally, is that a square matrix is 'structured' if upon a suitable 
permutation of rows and columns, it resembles a measurable function on the 
unit square. (For now, we will stick to square matrices only. It should not 
be difficult to generalize to rectangular matrices.) 

Let M. be the space of functions / : [0, 1]^ — )■ M such that < 1 

for all X and y. Since the elements of 7W are uniformly bounded, LP norms 
on M are equivalent for all p G (0,oo). We will henceforth consider the 1? 
topology as the natural topology on M. . 

A square matrix M of order n may be canonically represented as an 
element of 7W in the following way. Given M = {mij)Kij<n, define the 
function f^^ on [0, 1]^ as 

y) = rriij if < x < — and < y < — , 

n n n n 

and if either x = or y = 0, let f^{x, y) = 0. With the above convention, 
notice that for any two square matrices M, N of order n, 

1 " 1 
(10) 11/*^ -f^\\l, = - ^ (m., - n.,f = TV[(M - iV)2]. 

Given a matrix M and permutations Tr,a G Sn (where Sn is the group of all 
permutations of {1, . . . ,n}), let nMa be the matrix whose (i,j)th element 

is m^(^i)^(^jy 

Theorem 4.13. Suppose that we have a sequence of parameter matrices 
Mn, where is of order n and symmetric, two sequences of permutations 
TTn and an, where vr„,cr„ G Sn, and a measurable function f : [0,1]^ — )• M, 
such that 

j7T„M„a„ ^ J in as oo. 

For each n, let Xn be a data matrix with mean Mn and assume all the usual 
conditions for the symmetric model. For each n, let Pn be a number such 
that each entry on and above the diagonal of Xn is observed with probability 
Pn and unobserved with probability 1 — pn, independently of other entries. 
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Let Mn be the USVT estimate of Mn computed using the observed entries 
of Xn- Suppose that for all n, pn > n~^^'' for some fixed e > 0. Then 

MSE(Af„) < — + C{e)e-'''P-, 

where c is a universal constant, C{e) depends only on e, and Cn is a sequence 
of constants that do not depend on pn and satisfy liuin—^oo Cn = 0. In 
particular, if pn stays bounded away from zero ( or goes to zero sufficiently 
slowly), then 

lim MSE(M„) = 0. 

n— >oo 

Proof. By Theorem |2.1[ 

MSE(M„) < ^|,f"jh + C(e)e-^"P", 

W^i^/Pn 

where C and c are universal constants, and C(e) depends only on e. There- 
fore it suffices to show that 

hm — = 0. 

n— >oo n I 

Since ||M„||=i, = ||7rnM„a„||* and the USVT procedure is equivariant under 
permutations of rows and columns, it suffices to prove the theorem under 
the assumption that /*^" — )• / in L^. 

Take any two positive integers n and m. Let Mn,m '■= Mn (8> Jm, where 
Jm is the m X m matrix of all I's and denotes the Kronecker product. In 
other words, Mn^m is an n x n array of square blocks of order m, with all 
elements of the (i, j)th block equal to the (i, j)th element of M„. Similarly, 
define Mm,n := M„,® Jn- Note that, as functions on [0, 1]^, f^'^ = f^^^^^ 
and /*^™ = /^-f-.". Thus, 

Let Ai, . . . , A„ be the singular values of arranged in decreasing order. 
The singular values of Mn,m are precisely mXi,mX2, ■ ■ ■ ,mXn and mn — 
n zeroes. Similarly, if 6i,. . . ,6m are the singular values of Mm, then the 
singular values of Mm,n are n6i, . . . , n6m and mn — m zeroes. 

Create a vector of length mn by arranging the singular values of Mn,m in 
decreasing order. Do the same for the singular values of Mm,n- Pair each 
element of the first vector with the corresponding element of the second. 
For 1 < i < n, let //j be the singular value of Mm,n that is paired with the 
singular value mXi of Mn^m- Let Hn+i, ■ ■ ■ , l^'mn be the remaining mn — n 
singular values of Mm,n- Then by a well-known result from linear algebra 
(see Theorem |8.1 1 in Section [83]) and the identities (11) and (10), 



^(mAi-Aii)2+ ^ij <Tr[{Mn,m-Mm,nf] 

i=l i=n+l 

(12) =mV||/^" -/^-lli,. 
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Now note that 

m mn n mn 

i=l i=l i=l i=n+l 



n n mn 

' 'i ■ 



(13) < Aj + ^ |mAi - + ^ /i. 

i=l 1=1 i=n+l 

By the Cauchy-Schwarz inequahty and ( [l2| ), 

1/2 



(14) < mn3/2 II 



i=l ^ i=l 

Again, since there are at most m nonzero /ij's, another apphcation of the 
Cauchy-Schwarz inequality and ( [l2| ) gives 

1/2 



5^ /ii < f m ^ Mi] 

i=?i+l fi=n+l 



(15) <m3/2n||/^^™ -/^"||i2. 

Combining (flsl), (fTl) and (flsl), we get 



m 

n 

i=l j=l 

,3/2 



Dividing both sides by m ' n gives 

\\M^ < + (nVV-V2 + i)||/M„ _ ^M„||^^_ 

m'*'^ riy/ra 

Now let m — >■ oo, keeping n fixed. This gives 

limsupM^<||/M._/||^,. 

The proof is completed by taking n — )• oo on the right. □ 

5. The model-free case: A connection with Szemeredi's lemma 

Let X be an m X n matrix (where m < n) whose entries belong to the 
interval [—1,1]. Let X = YllXi ^iUivf be the singular value decomposition 
of X. Choose some k > and define 

X^ := smvf. 

i : Si>Kn 

In the USVT algorithm for the case of no missing entries, k = 2.02 n~^/^. 

The question is, what does X'^ represent when X is an arbitrary matrix, 
not occurring as a realization of some randomized model? Roughly speaking, 
the answer is as follows: Declare two rows of X to be 'similar' if the two 
corresponding rows of X'^ are close to each other in some reasonable metric. 
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Similarly, call two columns of X 'similar' if the two corresponding columns 
of are close to each other in some reasonable metric. It turns out that 
\i U \s a, bunch of rows of X that are similar to each other, and if y is a 
bunch of columns of X that are similar to each other, then the submatrix 
{xij)i£u,j£V is 'approximately random with independent entries'. To put it 
more succinctly, X looks like a random matrix from the stochastic hlockmodel 
ensemble, with blocks defined using the matrix X'^. 

For a precise understanding of the above paragraph, one needs to under- 
stand the setting of a famous result of Szemeredi, known as the 'regularity 
lemma'. The result may be described as follows. Let G = {V, E) be a simple 
graph, and let Z be subsets of V . Then we denote by e{Y, Z) the number 
of Y-Z edges of G, and define 

Here |y| denotes the size of Y. Given some e > 0, call a pair {A,B) of 
disjoint sets A,B (ZV e-regular if all y C ^ and Z O B with \Y\ > e\A\ 
and \Z\ > e\B\ satisfy 

\piY,Z)-p{A,B)\<e. 

A partition {Vq, . . . ,Vk} of V is called an e-regular partition of G if it 
satisfies the following three conditions: 

(i) l^ol < en; 

(ii) \Vi\ = \V2\ = --- = \Vk\; 

(iii) all but at most eK^ of the pairs {Vi,Vj) with 1 < i < j < K are 
e-regular. 

Szemeredi's regularity lemma goes as follows. 

Theorem 5.1 (Szemeredi's lemma [98j). Given e > and an integer m > 1 
there exists an integer M = M(e, m) such that every graph of order at 
least M admits an e-regular partition {Vq, . . . , Vk} for some K in the range 
m<K<M. 

This result was proved by Szemeredi j98| in 1976 and has found many appli- 
cations in combinatorics, number theory and other areas of discrete math- 
ematics. The version presented above is from Diestel [43| Section 7.2]. We 
will not use this result anywhere in this manuscript; it is stated here simply 
to put the discussion in perspective. 

The definition of an e-regular pair is of relevance to us. When a pair of 
subsets (^4, B) is e-regular, the intuitive meaning is that the edges between 
A and B are 'approximately randomly distributed'. The smaller e is, the 
more random is the distribution of edges. Alternatively, one may say that 
the edges are 'uniformly' or 'regularly' distributed. 
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One may similarly defined a notion of e-regularity for submatrices of our 
data matrix X. For any U C {1, . . . , m} and F C {1, . . . , n}, define 

\u\\v\ ' 

and say that the pair (U, V) is e-regular if for any U' C. U and V' QV with 
\U'\ > e\U\ and \V'\ > e\V\, we have 

\piU',V')-p{U,V)\<e. 

Having defined regularity, let us now define the notion of 'similarity' of rows 
and columns that was alluded to in the beginning of this section. Let xfj 
denote the (i,j)th entry of X'^. Define a distance r'^{i,i') between the ith 
and i'th rows of X as 



Similarly, define a distance c'^{j,j') between two columns j and /. For a 
subset U of {1, ... , m}, define 

r'^iU) := max r'^UA'). 
^ ' i,i'eU ' 



Similarly, for F C {1, . . . , n} define 

c\V) := max c«(i,/). 



Let K be small. If is a set of rows such that r'^{U) is small, then the rows 
in U are 'similar' to each other. The notion of 'similar columns' is defined in 
the same manner using c'^. Note that similarity is defined using X'^, whereas 
the notion of regularity refers to the original data matrix X. 

The following theorem shows that if J7 is a bunch of similar rows and V 
is a bunch of similar columns, and if the sizes of U and V are not too small, 
then the pair (U, V) is e-regular for some small e. The result may be useful 
for clustering rows and columns in real-life applications. 

Theorem 5.2. Take any three numbers k, S and r in the interval (0, 1]. Let 
a := ^/n/m. Let U be a subset of {1, ... , m} and V be a subset o/{l, . . . , n} 
such that r'^{U) < 5 and c'^iV) < 5. Additionally, suppose that \U\ > rm 
and \V\ > rn. Then the pair {U, V) is e-regular, where e = ^/2{Ka + S)/t. 

Proof. Define p'^{U,V) the same way that p{U,V) was defined, but using 
xfj instead of Xij. Take any U' Q U and V' C V such that \U'\ > e\U\ and 
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IE 



i&U',j&V'(^ij 



\U'\\V'\ 

\\x -x'^w 



KTl 



< 



VWWl VWW\ 



< 



\V'\ >e\V\. Then 

\p{U'X)-p^{U',V')\ 

(16) 

Similarly, 

(17) lMU.V)-,'iU,V)l^J\^i^^ , 

Let U" be any subset of U of the same size as U' . Then 



er 



Kn 



< 



Ka 



\p-iu',v')-P%u",v')\ 



lU'WV 



\U"\\V'\ 
6_ 



Similarly, if V" is a subset of V of the same size as V, then 



\p^{U",V') - p^{U",V"'^ 



< 



er 



Combining the last two displays we see that for any U" C U and V" C V 
such that \U"\ = \U'\ and \V"\ = \V'\, 



Ip^iU'y)- p''{U",V")\ < 



26 



er 



Averaging this inequality over all such U" and V" and applying Jensen's 
inequality, we get 

26 



\p-{U',V')-p^{U,V)\< 



er 



Combining this with 06^ and (17) gives 



|p([/',v")-Mf/,v)l<?^^2^ = .. 



er 



This completes the proof. 

6. An impossibility theorem for error estimates 



□ 



Theorem 



2.1 



gives an upper bound on the mean squared error of M. 
The estimate involves the nuclear norm of parameter matrix M. A natural 
question is: Is it possible to estimate the true MSE of M from the data? 

A straightforward approach is to use parametric bootstrap. Having esti- 
mated M using M, one may choose a large number K, generate K copies 
of the data using M as the parameter matrix, compute the estimates M^^\ 
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i = 1, . . . ,K for the K simulations, and estimate the MSE of M using the 
bootstrap estimator 

^ ||M»-M|i2 



MSEbs M = ^ V ^ ^ 



For the vahdity of the bootstrap estimate of the MSE, it is essential that 
the original M is an accurate estimate of M. In other words, we need to 
know a priori that MSE(M) is small to be able to claim that the bootstrap 



estimator of MSE(M) is accurate. Theorem 2.1 implies that if we know that 
||M||=K is small enough from assumptions, this is true. For example, in all 
of the examples from Section |4j it is plausible that the bootstrap method 
may be used to estimate the MSE of M (although the author does not quite 
know how to prove that). 

But is it possible to somehow determine whether MSE(M) is small or not 
from the data, if we do not make any assumption about M to start with? 
We will now show that it is impossible to do so, not only for the estimator 
M but for any 'non-trivial' estimator M. 

The definition of a non-trivial estimator is as follows. Given a parameter 
matrix M and a data matrix X satisfying the conditions of Section [2| the 
trivial estimator of M based on X is simply X itself. We will denote the 
trivial estimator as M^^". Now suppose we are given some estimator M. 
We will say that the estimator M is non-trivial if there exists a sequence of 
parameter matrices M„ and data matrices X„ such that 

MSE(M^^) 7^ as n ^ oo, 

but lim„_!.oo MSE(M„) = 0. In other words, M„ solves a non-trivial estima- 
tion problem. The USVT estimator is clearly non-trivial, as demonstrated 
by the examples from Section |4j 

Suppose that we have a non-trivial estimator M and a procedure P that 
gives an estimate MSEp(M) of the MSE of M. The MSE estimate is com- 
puted using only the data. The procedure will be called 'good' if the follow- 
ing two conditions hold: 

(a) Whenever Mn is a sequence of parameter matrices and Xn is a se- 
quence of data matrices such that MSE(M„) tends to zero, the esti- 
mate MSEp(M„) also tends to zero in probability. 

(b) Whenever M„ is a sequence of parameter matrices and Xn is a se- 
quence of data matrices such that MSE(Mn) does not tend to zero, 
MSEp(M„) also does not tend to zero in probability. 

Theorem 6.1. There cannot exist a good procedure for estimating the mean 
squared error of a non-trivial estimator. 

Proof. Suppose that a good procedure P exists. By the definition of non- 
triviality of the estimator M, there exists a sequence of parameter matrices 
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Mn and data matrices Xn such that 

(18) MSE(M^"^) 7^ as n ^ oo, 
but 

(19) hm MSE(M„) = 0. 

71— >00 

Then by the definition of goodness, 

MSEp(M„) —7-0 in probabihty as n — t- oo. 

Suppose, without loss of generality, that all the data matrices are defined 
on the same probability space. Then taking a subsequence if necessary, we 



may assume that in addition to (18) and il9v, we also have 



(20) P( lim MSEp(M„) = O) = 1. 

n— >oo 

Let := Xn and X'^ := Xn for all n. Consider as a (random) param- 
eter matrix and X'^ as its data matrix. Given M^, the expected value of X!^ 
is Mn, so it is okay to treat as a parameter matrix and X'n as its data 
matrix. We will denote the estimate of constructed using X'n as M^. 
Note that since Mn is random, the mean squared error of M^ is a random 
variable. 

Now, since the estimator is computed using the data matrix only, and 
Xn = Xn, it is clear that = M„. There is no randomness in M^ when 
Mn is given, since X^ = Mn- Thus, if r„ and c„ denote the number of rows 
and columns of M„, then 

mse(m;) = — iim;-m;i|2, 



1 

TnCn 
1 

^nC-n 
1 



\Mn — Xn\\p 



Mn-M^'-'Wl 



> -—\\M:^^- - MnWl WMn-MnWl, 

^^fn^^n TnC-n 

where the last step follows from the inequality (a + 6)^ < 2a? + 26^ and 
the triangle inequality for the Frobenius norm. Taking expectation on both 
sides gives 

E(MSE(M;)) > JmSE(mJ'") -MSE(M„). 



Therefore by ([18|) and (|19|), 

E(MSE(M;)) 7^ as n ^ oo. 
In particular, since mean squared errors are uniformly bounded by 1, 
(21) P(MSE(M;) 7^ as n ^ oo) > 0. 
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Again since MSEp is computed using the data matrix only, therefore for 
all n, 

MSEp(M„) = MSEp(M;). 



Therefore by (20), 



(22) P( lim MSEp(M;) = 0) = 1. 



The equations (21) and (22) demonstrate the existence of a sequence of 
parameter matrices and data matrices such that MSE(M^) -/^ but 
MSEp(M;) in probability. This contradicts the goodness of MSEp. □ 

7. Simulation results 
Although the USVT seems like a useful algorithm in theory, the unspeci- 



fied constant in Theorem 2.1 might make it a little suspicious in the eyes of 
a practitioner. To lay such fears to rest, the author resorted to applying the 
USVT algorithm to simulated data and computing the root mean squared 
errors (RMSE) of the obtained estimates, defined as 

RMSE(M) 



\ mn ^-^ ^-^ 



Actually, what is reported below is not the RMSE, but the following estimate 
of the RMSE: 

/ _rn_ _n_ \ 1/2 

RMSE(M) := ( 

V ran 




It was observed in simulations that RMSE has fairly small fluctuations for 
large matrices, and therefore may be used as a surrogate for the true RMSE. 
Abusing terminology, we will sometimes refer to RMSE as RMSE. 

7.1. Blockmodels. The first batch of simulations investigated applications 
of the USVT algorithm to the stochastic blockmodel. Given two numbers n 
and fc, each node among 1, . . . , n was assigned randomly to one of k blocks. 
A random symmetric k x k matrix P = {pab)i<a,b<k was generated, whose 
elements on and above the diagonal were chosen to be i.i.d. Uniform [0, 1] 
random variables. Given P, an n x n matrix M was defined as follows. 
If a node i belongs to block a and another node j belongs to block b, let 
iTiij = Pab- If ^ = j, let rriij = 0. Given the matrix M, a graph G on n 
vertices was generated to putting an edge between vertex i and vertex j 
with probability rriij, doing this independently for all 1 < i < j < n. The 
adjacency matrix X of G was taken as the data matrix, and the estimate 
M and its root mean squared error were computed. The results for various 
values of n and k are reported below. 
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n 


k = 2 


k = 5 


k = W 


100 


0.065 


0.139 


0.165 


200 


0.070 


0.087 


0.141 


500 


0.050 


0.066 


0.082 


1000 


0.033 


0.045 


0.060 



Table 1: Estimated RMSE in blockmodel simulation 



7.2. Latent space models. The second batch of simulations was done for 
latent space models. Alternatively, this may also be considered as simula- 
tions for graphons. A function / : [0, 1]^ — [0, 1] was chosen, and . . . , /3„ 
were generated as i.i.d. uniform random variables from the interval [0,1]. 
The (i,j)th entry of the matrix M was set to be rriij = f{Pi,(3j). Data 
was generated by choosing Xij to be a 0-1 valued random variable with 
P(xjj = 1) = niij. The root mean squared errors for various choices of / 
and n are reported below. The first two functions are examples of low rank 
models. The third one is a positive definite functions of full rank. The fourth 
one is a logistic model that appears in [31J : it is similar to the Bradley- Terry 
model 123]. 



f{x,y) 


n = 100 


n = 200 


n = 1000 


U^ + y) 


0.095 


0.073 


0.035 


xy 


0.075 


0.053 


0.027 


min{x, y} 


0.097 


0.075 


0.038 


1/(1 + 6-^-2^) 


0.098 


0.069 


0.032 



Table 2: Estimated RMSE in latent space model simulation 



7.3. Covariance matrix completion. The third batch of simulations was 
done for covariance matrix completion. A bunch of number xi, . . . ,Xn were 
chosen as i.i.d. uniform random variables from the interval [0,1], and the 
(i, j)th entry of M was defined as rriij = m.m{xi,Xj}. It is not difficult to 
verify that this is a positive definite matrix. This particular matrix was 
chosen because it is a positive definite matrix that is not of low rank; M has 
rank n with probability 1. A fraction p of randomly chosen elements of M 
was used as data for estimating M. The estimated RMSE for various values 
of n and p are reported below. They seem to be comparable with the upper 



bound of 1/y^np given by Theorem 4.9 



34 



SOURAV CHATTERJEE 



n 


p = 0.5 


p = 0.2 


p= 0.1 


100 


0.067 


0.350 


0.442 


200 


0.048 


0.096 


0.393 


1000 


0.038 


0.059 


0.079 



Table 3: Estimated RMSE in latent space model simulation 



7.4. Distance matrix completion. A fourth batch of simulations was 
done for distance matrices. Just for variety, n numbers xi,...,Xn were 
chosen independently and uniformly at random from the unit cube in M^. 
The {i, j)th entry of M was set to be rriij = d{xi, xj) where d is the Euclidean 
distance divided by \/3 (so that all entries are in [0,1]). A fraction p of 
randomly chosen entries of M were observed. The estimated RMSE for 
various values of n and p are reported below. From the simulations, it does 
seem that a larger sample size is needed for accuracy in dimension three, as 



indicated by the upper bound in Theorem 4.5 



n 


p = 0.5 


p = 0.2 


p= 0.1 


100 


0.102 


0.365 


0.388 


200 


0.094 


0.133 


0.377 


1000 


0.090 


0.117 


0.131 


2000 


0.028 


0.115 


0.125 



Table 4: Estimated RMSE in Euclidean distance matrix simulation 



Euclidean distance matrices are of low rank if the entries are the squared 
distances instead of actual distances. Therefore one may argue that the 
distance matrix completion problem may be solved one of the low rank 
algorithms. However, if one considers the distance instead of Euclidean 
distance, the matrix of distances (or any power of it) is no longer of low rank. 
Simulation results are presented below. Note that here the distance was 
divided by 3, so that all distances are < 1. 



n 


p = 0.5 


p = 0.2 


p= 0.1 


100 


0.091 


0.318 


0.337 


200 


0.094 


0.123 


0.334 


1000 


0.087 


0.114 


0.125 


2000 


0.030 


0.113 


0.121 



Table 5: Estimated RMSE in £ distance matrix simulation 
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7.5. Collaborative filtering. In recommender systems, vendors provide 
recommendations based on ratings submitted by users. Typically, each user 
rates only a few items. A special instance of this problem is the famous 
Netflix problem P] . Users (columns of the data matrix) rate movies (rows 
of the data matrix) on a scale of 1 to 5. The number of available ratings is 
small compared to the number of ratings that are missing. Completing this 
matrix would enable the vendor to recommend movies to users. This is an 
example of a general class of problems known as collaborative filtering [49j . 
A fifth and final batch of simulations was done for a simulated Netflix-type 
problem. In the simulated model, there were 500 movies and 5000 users. 
Each movie belonged to one of k genres and each user belonged to one of 
/ types. A bunch of numbers Pab, a = 1, . . . , fc, b = 1, . . . ,1 were chosen 
independently and uniformly at random from the set {1, 2, 3, 4, 5}. For each 
movie i of genre a and each user j of type b we assumed that j gives rating 
Pab to movie i. Only a fraction p of the ratings, randomly chosen, were 
observed. The estimated RMSE for various values of k, I and p are reported 
below. 



k, I 


p= 0.5 


p = 0.2 


p = 0.1 


2, 3 


0.154 


0.481 


0.502 


4, 10 


0.466 


1.129 


1.226 


10, 30 


0.949 


1.208 


1.315 



Table 6: Estimated RMSE in Netflix-type simulation 



7.6. Further work. The simulations in this section and in Section 2.5 pro- 
vide some evidence of the efficacy of the USVT method in practice. More 
extensive simulations, applications to real data, and numerical comparisons 
with existing techniques have been planned for a forthcoming project. 



8. Proof of Theorem [2H 

8.1. Matrix norms. Let A = (ajj)i<j<m, i<j<n be an m x n real matrix 
with singular values ai, . . . , cr^, where k = min{m, n}. The following matrix 
norms are widely used in this proof. 

The nuclear norm or the trace norm of A is defined as 
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The Frobenius norm, also called the Hilbert-Schmidt norm, is defined as 

/ m n X 1/2 / A: X 1/2 

^j=i j=i ^ ^i=i ^ 

Recall that, as noted in the inequality ([s]), 

ll^ll* < y/rank{A) \\A\\f. 
The sup-norm is defined as 

1 1 1 1 cxD • ~~ UlclX I ^ij I • 

The spectral norm or the operator norm of A is defined as 

ll^ll := max laA. 

l<i<k 

The spectral norm may be alternatively expressed as 
(23) ll^ll = max x'^Ay, 

where S™"^ and S"~^ are the Euclidean unit spheres in M"^ and respec- 
tively. The above representation implies that the spectral norm satisfies the 
triangle inequality. Consequently, for any two m x n matrices A and B, 

\\\A\\ - \\B\\ \ < \\A- B\\ < \\A- B\\f. 

In particular, the spectral norm is a Lipschitz function of the matrix entries 
(with Lipschitz constant 1), if the entries are collectively considered as a 
vector of length mn. 

The triangle inequality for the spectral norm also implies that the map 
j4 I— )• \\A\\ is convex. Indeed, for any < t < 1, 

\\tA + (1 - t)B\\ < t\\A\\ + {1- t)\\B\\. 

For more on matrix norms, see |14| . 

8.2. Perturbation of singular values. The following perturbative result 
from matrix analysis is used several times in this manuscript. Let A and 
B be two m X n matrices. Let k = min{m,n}. Let ai,...,ak be the 
singular values of A in decreasing order and repeated by multiplicities, and 
let Ti , . . . , Tfc be the singular values of B in decreasing order and repeated 
by multiplicities. Let 5i, . . . ,5^ be the singular values of A — B, in any order 
but still repeated by multiplicities. 

Theorem 8.1. For any 1 < p < oo, 

k k 

1=1 i=l 

and 

max IfTj — Tjl < max \5i\. 

l<i<k l<i<k 
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The above result follows, for example, from a combination of Theorem 
III. 4. 4 and Exercise II. 1.15 in [H]. It may also be derived as a consequence 
of Wielandt's minimax principle |14t Section III. 3] or Lidskii's theorem |14l 
Exercise III. 4. 3]. The case p = 2 is sometimes called the Hoffman- Wielandt 
theorem [51 Lemma 2.1.19 and Remark 2.1.20], and the inequality involving 
the maximum is sometimes called Weyl's perturbation theorem Corol- 
lary III.2.6]. 

8.3. Bernstein's inequality. The following inequality is known as 'Bern- 
stein's inequality'. 

Theorem 8.2. Suppose that Xi, . . . ,X„ are independent random variables 
with zero mean, and M is a constant such that \Xi\ < M with probability 
one for each i. Let S := Yll=i -^i ^ ■— Var(S'). Then for any t > 0, 

F{\S\ >t)< 2exp 

^' ' - ^ - Qv + 2Mt 

This inequality was proved by Bernstein |13j . For a discussion of Bern- 
stein's inequality and improvements, see Bennett |12j . 



8.4. Talagrand's concentration inequality. Recall that a median m of 
a random variable 1" is a real number such that ¥(Y < m) > 1/2 and 
P(y > m) > 1/2. The median may not be unique. 

The following concentration inequality is one of the several striking in- 
equalities that are collectively known as 'Talagrand's concentration inequal- 
ities'. 

Theorem 8.3. Suppose that f : [—1, 1]" — )■ M is a convex Lip schitz function 
with Lipschitz constant L. Let Xi, . . . , Xn be independent random variables 
taking value in [—1,1]. Let Y := f{Xi, . . . , X„) and let m be a median ofY . 
Then for any t > 0, 

P(|y-m| >t) <4e-*'/i6^'. 
For a proof of Theorem |8.3| see [99j Theorem 6.6]. 



It is easy to modify Theorem 8.3 to have concentration around the mean 



instead of the median. Just observe that by Theorem 8.3 , E(y— m)^ < 64L^. 
Since E(y - m)^ > Var(y), this shows that Var(y) < 64L2. Thus by 
Chebychev's inequality, 

p(|y-E(y)| > 16L) < ^. 

By the definition of a median, this shows that E(y) — 16L < m < E(y)+16-L. 
Together with Theorem |8. 31 , this implies that for any t > 0, 

(24) P(|y - E(y)| > 16L + t)< 4e~*'/2^'. 



The above inequality has a number of uses in the proof of Theorem 2.1 
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8.5. Spectral norms of random matrices. The following bound on spec- 
tral norms of random matrices is a crucial ingredient for this paper. The 
proof follows from a combinatorial argument of Vu |lUUj (which is itself a 
refinement of a classical argument of Fiiredi and Komlos |47j ) , together with 
Talagrand's inequality (24). 

Theorem 8.4. Take any two numbers m and n such that 1 < m < n. Sup- 
pose that A = {aij)i<i<rn,i<j<n is a matrix whose entries are independent 
random variables that satisfy, for some E [0, 1], 

E(ajj) = 0, lE(aij) < c^, and \aij\ < 1 a.s. 

Suppose that a"^ > n"^^'' for some e > 0. Then 

\A\\ > 2.0lCTVn) < C7i(e)e"^2<x2n^ 



where Ci(e) is a constant that depends only on e and C2 is a positive uni- 
versal constant. The same result is true when m = n and A is symmetric 
or skew- symmetric, with independent entries on and above the diagonal, all 
other assumptions remaining the same. Lastly, all results remain true if the 
assumption > n"^"*"^ is changed to > n~^(logn)^^^. 

Proof. First assume that m = n and A is symmetric. Note that for any even 
number k, 

(25) Epf < E(Tr(A'=)) = na^,^.a^,^, ■ ■ ■ a^,_,^,a,,,,). 

l<ji,...,is;<n 

Consider ii,i2, • • • , ifc_i, ifc, ii as a closed tour of a subset of the vertices of 
the complete graph on n vertices (with self-edges included). From the given 
assumptions about the a^j's, it follows that the term ^{aij^i^ai^ig ■ ■ ■ ai^i^) is 
zero if there is an edge that is traversed exactly once. Suppose that each 
edge in the tour is traversed at least twice. Let p be the number of distinct 
vertices visited by the tour. Then the number of distinct edges traversed by 
the tour is at least p — I. Since o"^ < 1, \aij\ < 1, and IE|ajj|' < o"^ for any 
/ > 2, this shows that 

(26) |E(ojii2ai2i3 • • • flifcii)! < CT^^"^. 

Thus, if W{n,k,p) is the number of tours of length k that visit exactly p 
vertices and traverse each of its edges at least twice, then 

k 

(27) E|| < (r^^~^W{n, k,p). 

p=i 

Vu |100^ equation (5)] proves that if p > k/2 then W{n,k,p) = and if 
p<k/2 then 

W{n,k,p) <n{n-l)---{n-p+l)( ^ V2(fc-2p+2)22p-2^ 

\2p - 2 J 
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Using this bound, one can proceed as in [100^ Section 2] to arrive at the 
conclusion that if k is largest even number < a^/^-n}/^ , then 

Epf < 2n(2aVn)^. 

Consequently, 

This shows that if cj^ > n^"*^^*^ (or if cj^ > n^^(logn)®"*''^), then there is a 
constant C{e) depending only on e such that if n > C(e) then 

(28) EPIl < 2.005o-Vn. 



Since aij are independent and \aij\ < 1 almost surely for all and the 
spectral norm is a convex Lipschitz function of matrix entries with Lipschitz 



constant 1 (by the discussion in Section 8.1), therefore one can apply Tala- 



grand's inequality (Theorem 8.3 and inequality (|24[)) together with (28) and 
the assumption that cr^ > n^^'' to conclude that there is a constant C(e) 
such that if n > C(e) then 

(29) F{\\A\\ > 2maV^) < Cie'^^"^'', 

where Ci and C2 are positive universal constants. Replacing Ci by a large 
enough constant Ci(e), the condition n > C(e) may be dropped. It is clear 
from the argument that it goes through in the skew-symmetric case as well. 
Let us now drop the assumption of symmetry, but retain the assumption 



that m = n. Let a^^ := aji. Then the inequality (25) must be modified to 
say that for any even k, 



Epf < ^Ty{{A^ Af/^)) 



E 



l<ii,...,ik<n 



As before, the term inside the sum is zero for any tour that traverses an 
edge exactly once. (In fact, there are more terms that are zero now; a term 
may be zero even if a tour traverses all of its edges at least twice.) Similarly, 



the inequalities ([26j) and (27) continue to hold and therefore so does the rest 
of the argument. 

Lastly, consider the case m < n. Augment the matrix A by adding an 
extra n — m rows of zeros to make it an n x n matrix that satisfies all the 
conditions of the theorem. Clearly, the new matrix has the same spectral 
norm as the old one. This completes the proof. □ 

8.6. The estimation lemma. Suppose that A and B are two m x n ma- 
trices, where m < n. Let aij be the (i, j)th entry of A and bij be the (i, j)th 
entry oi B. It is easy to see from definition that 
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Thus if 11^ — II is small enough, then the entries of A are approximately 
equal to the entries of B, on average. In other words, the matrix A is an 
estimate of the matrix B. 

The goal of this section is to show that if in addition to the smallness of 
11^ — -B II, we also know that the nuclear norm ||-B||* is not too large, it is 
possible to get a better estimate of B based on A. 

Lemma 8.5. Let A = "^2^1 '^i^iVi ^he singular value decomposition 
of A. Fix any 5 > and define 

B := ^ a-iXivf. 

i:a^>{l+S)\\A-B\\ 

Then 

\\B - B\\f < K{6){\\A - B\\\\B\Uy^\ 
where K{5) = (4 + 25)^/2/5 + ^/2TI. 

Proof. Let B = YllLi '^i'^i'^T be the singular value decomposition of B. 
Without loss of generality, assume that iTj's and r^'s are arranged in de- 
creasing order. Let S be the set of i such that ai > {1 + 5)\\A — B\\. Define 

G := ^ TiUivf. 

Note that by the definition of B, the largest singular value of ^ — -B is 
bounded above by (1 + 5)||^ — i?||. In other words, 

(30) \\A- B\\ < {1 + 6)\\A- B\\. 



On the other hand, by Theorem 8.1 



max |(Tj — Ti\ < 11^4 — i?||. 

l<i<m 

In particular, for i ^ S, 

(31) Ti<ai + \\A-B\\<{2 + 5)\\A-B\\, 
and for i £ S, 

(32) Ti>ai-\\A-B\\>6\\A-B\\. 



By (I31j), 

(33) \\B -G\\ <{2 + S)\\A- B\ 



By (30) and (33), we have 

(34) \\B - G\\ < \\B - A\\ + \\A - B\\ + \\B - G\\ < (4 + 26)\\A - B\\. 

Since B and G both have rank < l^l, the difference B — G has rank at 
most 2|5|. Using this and (p4|), we have 



(35) \\B-G\\f < y2|^||^-G|| < {4 + 26)^/2\S\\\A- B\\. 
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Next, observe that by ( [3l| ), 
(36) ||S-G||| = ^r2<(2 + 



\A- 



B\\Y,^i < (2 + 



\A-B\\\\B\ 



Combining (35) and (36), we have 
\\B 

(37) 



B\\F — W-B — G\\p + \\B — G\\p 

< {4 + 26)y^\\\A- B\\ + {{2 + 6)\\A - B\\\\B\ 



,1/2 



Next, note that by (32), 



l^ll* >^ri>6\S\\\A-B\ 



and thus 
(38) 



1^1 < 



\B\ 



6\\A-B\ 



Combining (37) and (38), the proof is complete. 



□ 



8.7. Finishing the proof of Theorem 2.1 We will prove the theorem 
only for the asymmetric model. The only difference in the proofs for the 
symmetric model and the skew-symmetric model is that we need to use 
the symmetric and skew-symmetric parts of Theorem 8.4 instead of the 
asymmetric part. 

Throughout this proof, C(e) will denote any constant that depends only 
on e, and C and c will denote universal constants. The values of C(e), C 
and c may change from line to line or even within a line. 

Note that for all i and j, 



prriij, 



and 
(39) 



\ai{yij) < K(4) = pE{xfj) < p. 

Let r]ij be the indicator that the (z,j)th entry is observed. Then r/jj's are 
independent Bernoulli random variables with expected value p, and 



mn 

i=i j=i 

Define two events Ei and E2 as 

El :={\\Y -pM\\< 2.01^}, 
E2 := {\p-p\ < O.OOlp}. 



By Theorem 8.4 
(40) 



P(^i) > 1 - C{e)e' 



cnp 
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By Bernstein's inequality (Theorem 8.2), for any t>0 



^{\p-p\ >t)< 2exp( 



V 6p + 2t 
In particular, 

(41) P(^2) > 1 - 2e^'='""P. 

Let 6 be defined by the relation 

(1 + 5)\\Y - pM\\ = 2.02 
If El and E2 both happen, then 



^ 2.02v/^ 2.02V0.999np 
1 + 5 > ^ J_ > — _ ^ > 1.004. 



2.0iy^ 



2mJrvp 



Let K{5) be the constant in the statement of Lemma 8.5 It is easy to see 
that there is a universal const ant C such that if 5 > 0.004, then K{5) < 
C\/l + 5. Therefore by Lemma 8.5, if Ei and E2 both happen, then 



(42) 



\\pW-pM\\l < Cil + 5)\\Y -pM\\\\pM\U 

< CyQiWpMlU 

< Cn^/^p^/^\\M\\^. 



rriijl < Iwij 



By the definition of M, it is obvious that |m. 
and j . Together with ( 42 ) , this shows that under Ei E2, 

|2 



p'^WM- M\\l <p'^\\W- 



M 



< Cp'^\\W-M\\l 

< C\\pW - pMfp + c{p - pf\\M\\l 

< Cn^/^p^/'^\\M\U + C{p - pfmn. 



Note that E{p - p)'^ = p/mn and that \\M - M|||, < 4mn. Thus by (|40]) 
and (|4l]), 

E||M - Mfp < Cn^/V^'^^ll^ll* + Cp~^ + Cmn{l - ¥{Ei n E2)) 
< Cn^/V^^^ll^ll* + Cp-^ + C{e)mne- 
Dividing throughout by mn, we arrive at the inequality 



■cnp 



(43) 



-EIIM - M\\i < 



mn 



mwnp np 



The next goal is to show that 



(44) 



-E||M-M||i < 



mn 



C\\M\ 



mn 
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First, suppose that ||M||=i, > O.OOly/n/p. Then 

||M||, 1 ^ C\\M\\l 
" + — < — — —, 
rriyjnp np mn 

and so (44) follows from (43). Therefore assume that ||M||=, < Omi^/n/p. 
Then in particular, ||M|| < O.OOly^n/p. Therefore if Ei r\E2 happens, then 



\\Y\\ < \\Y -pM\\ + \\pM\\ < 2.011^/np < l.OUy'np. 

This implies that there is no singular value of Y that exceeds 2.02\/np, and 
therefore M = 0. Consequently, 



\M -Ml 



\M\\i < \\M\ 



Thus, if ||M||* < O.OOiy^, then by ^ and 



-E\\M- M\\i < 



\M\ 



+ C7(l - P(^i n ^2)) < 



\M\ 



mn 



+ C(e)e- 



cnp 



mn mn 
This completes the proof of Theorem |2.1[ As an end-note, observe that if 
\ai{xij) < (7^ for all then the estimate (39) may be improved to 

Var(yjj) = pVar(xjj) + p{l — p)ni^j < max {ph + p{l —p)a), 

where R is the quadrilateral region 

{(a, b) : < a < 1, < b < a'^ , < a + b < 1}. 

The maximum must be attained at one of the four vertices of R. An easy 
verification shows that actually, the maximum is always attained at the 
vertex (1 — cr^, cj^), which gives the upper bound Var(?/ij) < q := pa"^ — 
p){l — fJ^). This allows us to replace the threshold 2.02-y/np by 2.02y^ng, 
where q = pa"^ + p{l — p){l — a'^) . As before we need that q > n^^"*"*^. 

9. Proof of Theorem [2?2] 

Throughout this proof, C will denote any positive universal constant, 
whose value may change from line to line. 

Take any 6 G [0, m-^/n] and let 9 := 5/{m^/n). We will first work out 
the proof under the assumption that p < 1/2. Under this assumption, three 
situations are considered. First, suppose that 

(45) 6/^ < 1 and mO^/p > 1. 

Let k := [m6y/p]. Clearly, k < m. Let M be an m x n random matrix 
whose first k rows consist of i.i.d. Uniform[— 1, 1] random variables, and 
copy this block [1/p] times. This takes care of k[l/p] rows. (This is okay, 
since k/p < mO/^Jp < m by (45).) Declare the remaining rows, if any, 
to be zero. Then note that M has rank < k < mO^/p. Therefore by the 
inequality ([3]), 

||M||* < {mO^f^WMWr < {mQ^fl'^imnQj^fl'^ = m^fRd. 
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Let X = M. Let D be our data, that is, the observed values of X. One can 
imagine D as a matrix whose (z,j)th entry is Xij if Xij is observed, and a 
question mark if Xij is unobserved. For any (^,J) belonging to the nonzero 
portion of the matrix M, M contains copies of rriij. Since the X- value 
at the location of each copy is observed with probability p, independent of 
the other copies, and p < 1/2, therefore the chance that none of these copies 
are observed is bounded below by a positive universal constant. If none of 
the copies are observed, then the data contains no information about m,j. 
Using this, it is not difficult to write down a formal argument that shows 

E(Var(mij | D)) > C. 

On the other hand, since rhij is a function of D, the definition of variance 
implies that 

E((mjj — rriij)'^ \ D) > \ai{mij \ D). 
Combining the last two displays, we see that 

k[l/p] n 



E||M-M|||, > ^ 



Tfiij Tflij ^ 



^2 



1 3=1 

CmnO 



(46) >Ck[l/p\n> 



The argument that led to the above lower bound is a typical example of 
the classical Bayesian argument for obtaining minimax lower bounds, and 
will henceforth be referred to as the 'standard minimax argument' to avoid 
repetition of details. 



Since mOyJp > 1, it follows from (46) that 



9 1 1 

— > — > — . 
y/p mp np 



Thus, under (45), there exists M with ||M||* < S such that 



MSE(M) > ^ + — . 

^ np 

Next, assume that 

(47) 9/^ < 1 and m9^ < 1. 

Let M be an m X n matrix whose first row consists of i.i.d. random variables 
uniformly distributed over the interval [—m9^/p,m9y/p], and this row is 
copied [1/p] times, and all other rows are zero. Then M has rank < 1, and 
therefore by ([s]), 

\\M\\^ < \\M\\f < mB^J- = m9y/n. 
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On the other hand, a standard minimax argument as before imphes that for 
any estimator M, 



EIIM - Mill > {mOJpf- = nm^^^ 

P 



In particular, under (47), there exists M with ||Af||* < 6 such that 

MSE(M) > . 

mn 

Lastly, suppose that 

(48) > 1. 

Let M be an m X n matrix whose first [mp] rows consist of i.i.d. random 
variables uniformly distributed over [—1,1], and this block is copied [l/p\ 
times. Then the rank of M is < [mp], and so by (48) and ([S]), 



< mJnp < 9my/n. 



Again by a standard minimax argument, it is easy to conclude that for any 
estimator M, there exists M with ||M||* < 5 such that 

MSE(M) > C. 

This completes the proof when p < 1/2. Next suppose that p > 1/2. The 
only place where the assumption p < 1/2 was used previously was for proving 
that E(Var(m.y|D)) > C. This can be easily taken care of by inserting some 
randomness into the data matrix X, as follows. First replace M by in 
all three cases above. This retains the condition ||M||* < 5. Given M, let 
X be the data matrix whose (i, j)th entry Xij is uniformly distributed over 
the interval [mij — 1/2, mjj + 1/2], whenever is the 'main block' of M; 
and this value of Xij is copied [1/p] times in the appropriate places. 

Since the entries of M are now guaranteed to be in [—1/2,1/2], this 
ensures that the entries of X are in [—1,1]. Now note that even if Xij 
or one of its copies is observed, it gives only limited information about mij. 
In particular, it is easy to prove that E(Var(mjj|D)) > C and complete the 
proof as before. 



This complete the proof of Theorem 2.2 for the asymmetric model. For 
the symmetric model, simply observe that the singular values of any square 
matrix M are the same as those of the symmetric matrix 

M 
M'^ 

with multiplicity doubled. It is now clear how the minimax arguments for 
the asymmetric model may be carried over to the symmetric case by consid- 
ering the same Bayesian models for M and working with the corresponding 
symmetrized matrices. For the skew-symmetric case, replace the by 
— in the above matrix. 
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10. Proof of Theorem 



Theorem 2.3 follows from a simple application of the estimation lemma 
(Lemma 8.5). In the estimation lemma, take A = p^^Y, B = M, and 
B = W, where W is defined as in the USVT algorithm, with p = p since p 
is known. Let 5 be as in the estimation lemma, i.e. 5 solves 

(i + (5)||p-^y-M|| = 1.017. 

Then, if \\p-^Y - M\\ < 7, 

I.OI7 

5 = - — - 1 > O.OL 

\\p-^Y - M\\ 

There is a universal constant C such that 6 > 0.01 implies K{6) < C\/l + 6. 
(As usual, C will henceforth denote any universal constant, whose value may 
change from line to line.) Therefore if — M\\ < 7, then 

\\M-Mfp < \\W- M\\l 

< K{6f\\p-^Y - M\\\\M\U 

< C(l + 5)||p~^y-M||||M||, 

< C7IIMII,. 

Consequently, \\p~'^Y - M|| < 7 implies ||M - M|||, < C7||M||=^. On the 
other hand, if ""^y—MH > 7, we have the trivial bound ||M— M|||, < 4mn. 
Combining, we have 

MSE(M) < + Ce. 

mn 



This completes the proof of Theorem 2.3 



11. Discussion and summary 

This paper introduces a new technique for estimating the entries of a large 
matrix when the data is a randomly chosen subset of the original entries, 
possibly with corrupted by noise. There is a large body of literature devoted 
to this general problem that is summarized in Section [3j The method of 
this paper works by first scaling the data matrix so that all entries are in 
[—1,1], then taking the singular value decomposition of the matrix (with 
zeros substituted for missing values) and then truncating the SVD at the 
universal threshold of 2.02\/np, where p is the proportion of observed values 
and n is the maximum of the number of rows and the number of columns. In 
the final step, the matrix is scaled up by a factor of 1 /p and the entries that 
have absolute value greater than 1 are substituted by 1 or —1, depending 
on whether they are greater than 1 or less than —1. 

Due to the universality of the threshold, the author calls it the Universal 
Singular Value Thresholding (USVT) algorithm. The USVT method gives 
a good estimate when the unknown matrix has a 'little bit of structure', 
where structure is measured by the nuclear norm. Incidentally, the idea of 
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estimating matrices with low nuclear norm has also appeared very recently 
in a paper of Davenport et. al. [38j. 

As is clear from the above description, the USVT algorithm is very sim- 
ple and fast to implement, with the potential for application to extremely 



large data matrices. In Theorem 2.2, the USVT estimator is shown to be 



minimax optimal up to a constant factor in the risk. The algorithm works 
for matrices that are not necessarily of low rank. On the flip side, the algo- 
rithm only gives approximate recovery as opposed to the exact recovery of 
low rank matrices (under the additional assumption of incoherence) given 
by the convex relaxation methods of Candes and coauthors |26l [25l [28l . 

Singular value thresholding has been used before [101 H] aiid especially in 
the important recent papers of Keshavan et. al. |651l66j: the main new contri- 
bution of this paper is the universal threshold of 2.02^/np, which eliminates 
the need for any structural information about the unknown matrix such 
as rank, etc. The available SVD based algorithms require the underlying 
matrix to be of low rank and that the rank be known a priori. 

The other new contribution of the paper is the solution of many con- 
crete matrix estimation problems in Section |4j including problems related 
to stochastic blockmodels, distance matrix completion, latent space models, 
positive definite matrix completion, graph limits and generalized Bradley- 
Terry models. Many of these problems involve matrices that are not of low 
rank and are not known to be solvable by any existing method. Simulation 
results from Section 2.5 and Section [7] indicate that the method may work 
quite well in practice. Further simulations and numerical comparisons with 
popular methods have been planned for a forthcoming project. 
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